44 research outputs found
Text2Action: Generative Adversarial Synthesis from Language to Action
In this paper, we propose a generative model which learns the relationship
between language and human action in order to generate a human action sequence
given a sentence describing human behavior. The proposed generative model is a
generative adversarial network (GAN), which is based on the sequence to
sequence (SEQ2SEQ) model. Using the proposed generative network, we can
synthesize various actions for a robot or a virtual agent using a text encoder
recurrent neural network (RNN) and an action decoder RNN. The proposed
generative network is trained from 29,770 pairs of actions and sentence
annotations extracted from MSR-Video-to-Text (MSR-VTT), a large-scale video
dataset. We demonstrate that the network can generate human-like actions which
can be transferred to a Baxter robot, such that the robot performs an action
based on a provided sentence. Results show that the proposed generative network
correctly models the relationship between language and action and can generate
a diverse set of actions from the same sentence.Comment: 8 pages, 10 figure
Interactive Text2Pickup Network for Natural Language based Human-Robot Collaboration
In this paper, we propose the Interactive Text2Pickup (IT2P) network for
human-robot collaboration which enables an effective interaction with a human
user despite the ambiguity in user's commands. We focus on the task where a
robot is expected to pick up an object instructed by a human, and to interact
with the human when the given instruction is vague. The proposed network
understands the command from the human user and estimates the position of the
desired object first. To handle the inherent ambiguity in human language
commands, a suitable question which can resolve the ambiguity is generated. The
user's answer to the question is combined with the initial command and given
back to the network, resulting in more accurate estimation. The experiment
results show that given unambiguous commands, the proposed method can estimate
the position of the requested object with an accuracy of 98.49% based on our
test dataset. Given ambiguous language commands, we show that the accuracy of
the pick up task increases by 1.94 times after incorporating the information
obtained from the interaction.Comment: 8 pages, 9 figure
A Unified Masked Autoencoder with Patchified Skeletons for Motion Synthesis
The synthesis of human motion has traditionally been addressed through
task-dependent models that focus on specific challenges, such as predicting
future motions or filling in intermediate poses conditioned on known key-poses.
In this paper, we present a novel task-independent model called UNIMASK-M,
which can effectively address these challenges using a unified architecture.
Our model obtains comparable or better performance than the state-of-the-art in
each field. Inspired by Vision Transformers (ViTs), our UNIMASK-M model
decomposes a human pose into body parts to leverage the spatio-temporal
relationships existing in human motion. Moreover, we reformulate various
pose-conditioned motion synthesis tasks as a reconstruction problem with
different masking patterns given as input. By explicitly informing our model
about the masked joints, our UNIMASK-M becomes more robust to occlusions.
Experimental results show that our model successfully forecasts human motion on
the Human3.6M dataset. Moreover, it achieves state-of-the-art results in motion
inbetweening on the LaFAN1 dataset, particularly in long transition periods.
More information can be found on the project website
https://sites.google.com/view/estevevallsmascaro/publications/unimask-m
Generative Autoregressive Networks for 3D Dancing Move Synthesis from Music
This paper proposes a framework which is able to generate a sequence of
three-dimensional human dance poses for a given music. The proposed framework
consists of three components: a music feature encoder, a pose generator, and a
music genre classifier. We focus on integrating these components for generating
a realistic 3D human dancing move from music, which can be applied to
artificial agents and humanoid robots. The trained dance pose generator, which
is a generative autoregressive model, is able to synthesize a dance sequence
longer than 5,000 pose frames. Experimental results of generated dance
sequences from various songs show how the proposed method generates human-like
dancing move to a given music. In addition, a generated 3D dance sequence is
applied to a humanoid robot, showing that the proposed framework can make a
robot to dance just by listening to music.Comment: 8 pages, 10 figure
Self-Supervised Motion Retargeting with Safety Guarantee
In this paper, we present self-supervised shared latent embedding (S3LE), a
data-driven motion retargeting method that enables the generation of natural
motions in humanoid robots from motion capture data or RGB videos. While it
requires paired data consisting of human poses and their corresponding robot
configurations, it significantly alleviates the necessity of time-consuming
data-collection via novel paired data generating processes. Our self-supervised
learning procedure consists of two steps: automatically generating paired data
to bootstrap the motion retargeting, and learning a projection-invariant
mapping to handle the different expressivity of humans and humanoid robots.
Furthermore, our method guarantees that the generated robot pose is
collision-free and satisfies position limits by utilizing nonparametric
regression in the shared latent space. We demonstrate that our method can
generate expressive robotic motions from both the CMU motion capture database
and YouTube videos
Human-Object Interaction Prediction in Videos through Gaze Following
Understanding the human-object interactions (HOIs) from a video is essential
to fully comprehend a visual scene. This line of research has been addressed by
detecting HOIs from images and lately from videos. However, the video-based HOI
anticipation task in the third-person view remains understudied. In this paper,
we design a framework to detect current HOIs and anticipate future HOIs in
videos. We propose to leverage human gaze information since people often fixate
on an object before interacting with it. These gaze features together with the
scene contexts and the visual appearances of human-object pairs are fused
through a spatio-temporal transformer. To evaluate the model in the HOI
anticipation task in a multi-person scenario, we propose a set of person-wise
multi-label metrics. Our model is trained and validated on the VidHOI dataset,
which contains videos capturing daily life and is currently the largest video
HOI dataset. Experimental results in the HOI detection task show that our
approach improves the baseline by a great margin of 36.3% relatively. Moreover,
we conduct an extensive ablation study to demonstrate the effectiveness of our
modifications and extensions to the spatio-temporal transformer. Our code is
publicly available on https://github.com/nizhf/hoi-prediction-gaze-transformer.Comment: Accepted by CVIU https://doi.org/10.1016/j.cviu.2023.10374
Robust Human Motion Forecasting using Transformer-based Model
Comprehending human motion is a fundamental challenge for developing
Human-Robot Collaborative applications. Computer vision researchers have
addressed this field by only focusing on reducing error in predictions, but not
taking into account the requirements to facilitate its implementation in
robots. In this paper, we propose a new model based on Transformer that
simultaneously deals with the real time 3D human motion forecasting in the
short and long term. Our 2-Channel Transformer (2CH-TR) is able to efficiently
exploit the spatio-temporal information of a shortly observed sequence (400ms)
and generates a competitive accuracy against the current state-of-the-art.
2CH-TR stands out for the efficient performance of the Transformer, being
lighter and faster than its competitors. In addition, our model is tested in
conditions where the human motion is severely occluded, demonstrating its
robustness in reconstructing and predicting 3D human motion in a highly noisy
environment. Our experiment results show that the proposed 2CH-TR outperforms
the ST-Transformer, which is another state-of-the-art model based on the
Transformer, in terms of reconstruction and prediction under the same
conditions of input prefix. Our model reduces in 8.89% the mean squared error
of ST-Transformer in short-term prediction, and 2.57% in long-term prediction
in Human3.6M dataset with 400ms input prefix.Comment: This paper has been already accepted to the 2022 IEEE/RSJ
International Conference on Intelligent Robots and Systems (IROS 2022
Visually Grounding Instruction for History-Dependent Manipulation
This paper emphasizes the importance of robot's ability to refer its task
history, when it executes a series of pick-and-place manipulations by following
text instructions given one by one. The advantage of referring the manipulation
history can be categorized into two folds: (1) the instructions omitting
details or using co-referential expressions can be interpreted, and (2) the
visual information of objects occluded by previous manipulations can be
inferred. For this challenge, we introduce the task of history-dependent
manipulation which is to visually ground a series of text instructions for
proper manipulations depending on the task history. We also suggest a relevant
dataset and a methodology based on the deep neural network, and show that our
network trained with a synthetic dataset can be applied to the real world based
on images transferred into synthetic-style based on the CycleGAN.Comment: 8 pages, 6 figure
Expression, Immobilization and Enzymatic Properties of Glutamate Decarboxylase Fused to a Cellulose-Binding Domain
Escherichia coli-derived glutamate decarboxylase (GAD), an enzyme that catalyzes the conversion of glutamic acid to gamma-aminobutyric acid (GABA), was fused to the cellulose-binding domain (CBD) and a linker of Trichoderma harzianum endoglucanase II. To prevent proteolysis of the fusion protein, the native linker was replaced with a S3N10 peptide known to be completely resistant to E. coli endopeptidase. The CBD-GAD expressed in E. coli was successfully immobilized on Avicel, a crystalline cellulose, with binding capacity of 33 ± 2 nmolCBD-GAD/gAvicel and the immobilized enzymes retained 60% of their initial activities after 10 uses. The results of this report provide a feasible alternative to produce GABA using immobilized GAD through fusion to CBD
Visually Grounding Language Instruction for History-Dependent Manipulation
This paper emphasizes the importance of a robot's ability to refer to its task history, especially when it executes a series of pick-and-place manipulations by following language instructions given one by one. The advantage of referring to the manipulation history can be categorized into two folds: (1) the language instructions omitting details but using expressions referring to the past can be interpreted, and (2) the visual information of objects occluded by previous manipulations can be inferred. For this, we introduce a history-dependent manipulation task which objective is to visually ground a series of language instructions for proper pick-and-place manipulations by referring to the past. We also suggest a relevant dataset and model which can be a baseline, and show that our model trained with the proposed dataset can also be applied to the real world based on the CycleGAN. Our dataset and code are publicly available on the project website: https://sites.google.com/view/history-dependent-manipulation